SearchUnit Documentation: How do I...: ...ignore documents & pages

Ignoring Pages & Documents

Using Index Manager Tool

To exclude or ignore certain pages on the web-site, before importing a new source, select 'More Options' and set 'Path matches to be ignored' to a part of the path that should be ignored.

If for example a support forum on the site has numerous duplicate pages with different URLs (reply links, reply with quote links) then it is desirable to exclude the duplicates from the index at the crawling stage.

Suppose that the links are as follows:
http://localhost/forum.aspx?action=reply&mesgid=1

http://localhost/forum.aspx?action=view&mesgid=1

http://localhost/forum.aspx?action=reply&mesgid=2

http://localhost/forum.aspx?action=view&mesgid=2

and that the 'reply' pages contain the same text as the 'view' pages. To prevent the 'reply' pages from being added to the index at the crawl stage we can add an element to the 'Path matches to be ignored' collection; "forum.aspx?action=reply" - which will prevent any URLs that contain that text from being added. It is a collection and can contain multiple URL segments to be ignored.

Note, if an index has already been created, setting this will have no effect on documents already in the index, so if necessary, delete the index and recrawl.

Robots Meta Tag

The de facto standard 'robots meta tag' can be used to control whether links are read (nofollow), or content is indexed (noindex), for a page.

<meta name="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"/>

To ignore the robots meta tag, set RespectsRobotsMetaTags to false in the configuration.

Keyoti Robots Meta Tag

In cases where it is desired to only block the SearchUnit indexer, but still allow other ROBOTS such as Google, use the custom Keyoti_ROBOTS tag in a similar way to the standard ROBOTS tag.

<meta name="Keyoti_ROBOTS" CONTENT="NOINDEX, NOFOLLOW"/>

The configuration parameter RespectsRobotsMetaTags has no effect on whether this tag is observed.

Programmatically

Specifying pages not to index can be handled programmatically during import by passing an ArrayList of strings;

C#

   DocumentIndex imp = new DocumentIndex(Configuration);
   urlStrings = imp.Import(new WebsiteBasedIndexableSourceRecord(startUrlString, 
				pathMatchesToBeIgnoredList, pathMatchesToBeIncludedList));
   imp.Close();

(pathMatchesToBeIgnoredList and pathMatchesToBeIncludedList can be null/Nothing)